The impact of data imputation on air quality prediction problem

With rising environmental concerns, accurate air quality predictions have become paramount as they help in planning preventive measures and policies for potential health hazards and environmental problems caused by poor air quality. Most of the time, air quality data are time series data. However, due to various reasons, we often encounter missing values in datasets collected during data preparation and aggregation steps. The inability to analyze and handle missing data will significantly hinder the data analysis process. To address this issue, this paper offers an extensive review of air quality prediction and missing data imputation techniques for time series, particularly in relation to environmental challenges. In addition, we empirically assess eight imputation methods, including mean, median, kNNI, MICE, SAITS, BRITS, MRNN, and Transformer, to scrutinize their impact on air quality data. The evaluation is conducted using diverse air quality datasets gathered from numerous cities globally. Based on these evaluations, we offer practical recommendations for practitioners dealing with missing data in time series scenarios for environmental data.


Introduction
Air helps sustain human life, so air tracking and understanding its quality is essential for our health.Air pollutants can pose significant threats to public health, and sources of air pollution can come from nature, such as smoke from volcano eruptions or forest fires, methane from animals' process of digesting food, or radon gas from radioactive decay in the earth's crust.In addition, pollution can also come from manufacturing activities such as industry and agriculture.They emit CO 2 , CO, SO 2 , NO 2 , and other organic substances at extremely high concentrations, polluting the air.Besides, burning fossil fuels yields climate change and air pollution.Therefore, air quality has still been a concern in recent years.Consequently, environmental researchers mine air quality data to uncover potential value and information from these data, thereby capturing user behavior, estimating disease causes, discovering gases, detecting individual actions to reduce greenhouse gases, acid rain, etc., and then advising management agencies and local governments to plan related policies.By using machine learning techniques, the local air quality data can be analyzed using sensors that gather real-time humidity and temperature readings.Duong et al. (2021) effectively extracted pertinent features from a dataset.They applied machine learning models to forecast AQI (Air Quality Indexing) values and levels at any user-specified location in Ho Chi Minh City [1].The dataset includes data on six atmospheric pollutants: SO 2 , NO 2 , PM10, PM2.5, CO, and O 3 , collected by volunteers who traversed predetermined routes to provide ground-truth AQI levels.
Many air quality data are in the form of time series and can contain missing values due to corrupted sensors, loss of electricity, etc.In such cases, data imputation, i.e., filling in missing values with some reasonable value according to some criteria, is a conventional practice to resolve the issue.The quality of imputation can significantly impact the downstream classification or prediction task.One can characterize missing data into three types: missing completely at random (MCAR), where the missing values are independent of any other values; missing at random (MAR), where missing values depend only on observed values; and missing not at random (MNAR), where missing values depends on both observed and unobserved values [2].There are many methods to deal with missing values based on the missing data mechanism.This work focuses on the MAR case, as it is prevalent for sensor data related to the environment [3].Furthermore, most air quality observation data are time series data.Dealing with missing values in time series data is often difficult, time-consuming, and labor-intensive.In addition, the missing data can significantly affect the processing and analysis of data.Therefore, handling missing values in time-series air quality data is necessary.
People can reveal critical enhancements regarding performance and running time by examining newly introduced approaches for data imputation.Multiple imputations can be further applied with these imputation methods to reduce the uncertainty by repeating the imputation procedure numerous times and averaging the results.Combining the imputation methods with forecasting models often results in a two-step process where imputation and forecasting models are separated.By doing this, the missingness is effectively explored in the forecasting model, thus leading to suboptimal analysis results.In addition, many imputation methods also have other requirements that may not be satisfied in real applications; for example, many of them work on data with low missing rates only, assume the data is missing randomly or completely at random, or can not handle time series data with varying lengths [4].Moreover, training and applying these imputation methods are usually computationally expensive.
Various imputation techniques have been proposed to fill in missing values, each using a distinct set of assumptions, algorithms, and performance metrics.Choosing a relevant imputation method can significantly influence the subsequent analysis and the reliability of the results.To give a thorough comparative analysis of various missing data imputation methods for time series air quality of data [5], we compare several conventional but often used imputation techniques (mean, median, kNNI, and MICE) with several recently developed imputation techniques for temporal data (SAITS, BRITS, MRNN, and Transformer) to examine their impact on air quality data from various places.On the other hand, the rate of missing data can also impact the problem-solving strategy we use since missing values can be handled in the step of data preprocessing.Various works have conducted experiments under a variety of missing rates.For example, [6] conducts experiments with missing rates from 5% to 50%.Nevertheless, in some other papers, the missing values rate could range from 1% to 80% of data [7][8][9], or start from 10% to 50% [10].
While some work [11] has been done to compare the performance of classical and newly developed time series imputation techniques such as BRITS [12], SAITS [13] for health care data, such practical comparisons for air quality has not been conducted yet.In addition, while there have been several works that examine the effects of imputation on air quality [14][15][16], most of them do not cover state-of-the-art imputation methods for time series that have been developed in recent years.In addition, up to our knowledge, while there have been some surveys on air quality prediction [17,18] or missing data imputation for air quality data [19,20], there has not been any work that reviews both problems and systematically compares state-ofthe-art imputation algorithms for air quality data.This motivates us to review recent studies related to the air quality prediction problem, along with missing values handling methods or techniques on time series data with a concentration on air quality data.In addition, we also empirically evaluate various time series imputation techniques, including classical and stateof-the-art methods for air quality data.In summary, the contribution of our work can be described as follows: 1. We review existing techniques for air quality prediction and missing data imputation.
2. We conduct experiments on various air quality datasets to compare the performance of various time series imputation methods using various measures.
3. We provide analysis and evaluation of the performance of techniques.
4. We provide practitioners practical guidance on how to deal with missing data in air quality data.
The structure of the paper can be organized as follows.Firstly, Section 1 gives an overview of the current research related to data and missing values and describes the research problem in Section 2. Afterward, we review the prediction methods for air quality data in Section 3. Besides, we also review related techniques imputing missing values in time series data from conventional to modern data in Section 4. Next, in Section 5, we present methods for imputing the missing values in this paper.Experiments compare and evaluate the results and imputation time of the methods and the accuracy of prediction models on air pollutant values and AQI levels on the different datasets in Section 6.Then, we discuss the related problems to impute missing values in Section 7. Finally, the paper ends with our conclusion and future works in Section 8.

Problem formulation
Most urban areas worldwide, including Vietnam, are facing increasing air pollution.Among them, the problem of air pollution due to dust is still the most prominent.In some large cities like Hanoi, the number of days with PM10 and PM2.5 dust pollution levels exceeding the limits is relatively high.The problem is how to reduce the impact of air pollution on human health.Therefore, to solve the above problem, experts believe that if air pollution is informed early in the form of prediction, it can help people proactively plan their lives, especially on days when air pollution is high, minimizing the effects of air pollution on health.Thereby, people will know and choose how to protect their health and that of their family members.Many countries predict air quality from three to five days in advance based on air and meteorological data (such as temperature, humidity, wind direction, and topography) from air monitoring stations.However, in implementing the problem of collecting data through sensors, the possibility of data loss of information occurs very often and unavoidably.Through this paper, we also present ways to handle missing data and how it will affect the problem of air pollution prediction or similar time series problems.
Missing data can exist in various ways, for example, at individual points or over intervals, where one sensor loses data for a period of time.In this section, we introduce preliminary definitions and formalize the problem of air quality imputation.Air quality data is generally collected from a set of sensors over different periods.We focus on the time series data with missing values.Some notations are defined to describe this problem.
For the rest of the paper, we denote a multivariate time series with D variables of time length T as X ¼ fx 1 ; x 2 ; . . .; x t ; . . .; x T g T 2 R T�D , where for each t 2 {1, 2, . .., T}, x t ¼ fx denotes the measurement of d-th variable of x t , where d = 1, . .., D. Let s t 2 R denote the timestamp when the t-th observation is obtained, and we assume that the first observation is made at time-stamp 0. A time series X can have missing values.We introduce a masking vector m t 2 {0, 1} D to denote which variables are missing at time step t; the mask token is set to 0 for missing structured data, and the others are set to 1.To represent the missing variables in X, the missing mask vector M 2 R T�D is introduced, where In addition, an indicating mask vector I is introduced to differentiate originally missing values and artificially missing values.
Moreover, the missing ratio for the dataset is defined as follows: p ¼

Number of the missing values
Total number of values : ð3Þ For example, we assume that X is the input time series matrix with three variables, M is the masking matrix for X, and T is the time stamp for X.In the following, we denote x d t ¼ NaN if the value is missing.For example, assume we have nine observations (T = 9), (D = 3), then these matrix are, respectively, If this matrix contains any missing values NaN, we estimate these values using imputation techniques.We create the masking matrix M, each with NaN value, which is replaced by 0 and the others by 1.
The performance of each imputation model is computed by considering the indicating mask.The missing values in the matrix X will be imputed using traditional imputation techniques (i.e., Mean, Median, MICE, kNNI) and recently developed imputation techniques (i.e., SAITS, BRITS, MRNN, Transformer).In what follows, we will review the current approaches in detail.

Air quality prediction: Existing techniques
In the current studies, there is a wealth of research on air quality prediction due to its importance in informing about the pollution level that will allow policy-makers to adopt measures for reducing its impact [21,22].Methods for air quality prediction can be classified into statistical, machine learning [23,24], and deep learning approaches.

Vector Auto-Regression (VAR).
One of the most popular statistical models for forecasting multivariate time series is the Vector Auto-Regression (VAR).It is considered an extension of the univariate autoregressive model.The findings of [25] have revealed that the VAR model is particularly valuable in capturing the dynamic characteristics of economic and financial time series, making it a powerful tool for describing their behavior and making forecasts.In [26], VAR was used to forecast daily concentrations of air pollutants (i.e., CO, NO 2 , and SO 3 ) in Tehran city for the next 24h.For such a task, the authors have considered the correlations between air pollutants to get more accurate forecasts.Experimental results have indicated the high efficiency of the proposed method.

Autoregressive Integrated Moving (ARIMA).
Aside from VAR, ARIMA algorithms [27] were applied to forecast air quality.In [28], authors proposed a hybrid method named ARIMAX by combining the advantage of ARIMA and numerical modeling to forecast real-time air pollutants in Hong Kong (i.e., PM2.5, O 3 , and NO 2 ).By employing experimental analysis, the proposed method significantly improves the quality of forecast results in multiple evaluation metrics.Similarly, the findings in [29] have shown the prominent role of ARIMA in forecasting PM10 in Dakar, Senegal.Accordingly, the proposed method combines system observations with multi-agent real-time simulation and evaluates with several simulations.

Traditional machine-learning methods
Some traditional machine learning algorithms used for air quality prediction can be Support Vector Regression (SVR), Random Forest (RF), and Linear Regression (LR).

Support Vector Regression (SVR)
. SVR models were used to forecast PM2.5 and PM10 in London [30].In that paper, the experimental results indicate the SVR's efficiency in forecasting air quality parameters (i.e., PM2.5 and PM10).A nonlinear dynamic model based on the SVR technique was proposed to forecast AQI in Oviedo, Spain [31].Accordingly, the proposed model first analyzed the relationship between primary and secondary pollutants.Then, it derived vital factors influencing the air quality and recommended potential enhancements for health and lifestyle.Zhu et al. [32] investigated an application of the SVR algorithm with a quasi-linear kernel for air quality prediction.For such a task, the paper designed a gated linear network to construct the multiple piecewise linear model, and it could be developed through the pre-training of a Winner-Take-All (WTA) autoencoder.This approach could outperform other state-of-the-art methods in the case of complex air quality prediction problems.It is due to the WTA strategy reducing the risk of overfitting and choosing appropriate sparsity parameters.

Random Forest (RF).
Regarding RF [33,34] proposed a parallel approach combined with Spark to forecast PM2.5 in Beijing.The experimental results revealed the efficiency and scalability of the proposed method in the case of big data.Later, RF was used to select the most important features to improve the quality of real-time air quality prediction [35].Concretely, the proposed method provides highly accurate predictions of three air pollutants (i.e., PM2.5, NO 2 , SO 2 ) and outperforms other state-of-the-art methods.

Linear regression.
Linear regression is also a state-of-the-art model for air quality prediction.Indeed, many linear regression models have been proposed to predict AQI levels in New Delhi [36]; In Catalonia [37], authors combined factors including the effect of the surface reflectance capacity of urban surfaces with solar radiation and elevation to predict AQI level in Catalonia.The dataset is collected from 75 different air quality monitoring stations.A clustering technique was applied to cluster these stations based on their similarity.Meanwhile, Multiple Linear Regression (MLR) was used to replicate the annual mean values of AQI in Catalonia.Experimental results illustrated that the proposed model provided highly accurate predictions of AQI.Djuric et al. [38] proposed a multiple linear regression to forecast air pollution indices (i.e., SO 2 , NO 2 , PM10, O 3 , and CO) in Belgrade, Serbia.They collected the training and testing sets from the winters of 2011 and 2012/2013, respectively.In addition, the findings show that the proposed model can be scaled up to forecast long-term air quality.

Deep learning techniques
3.3.1Long short-term memory.Apart from the traditional machine learning approaches, most deep learning methods, such as Long short-term memory (LSTM), have shown their superiority over many machine learning techniques.Even though in [39], the LSTM model has outperformed MLP and RNN models in predicting PM10 and SO 2 in the Basaksehir district of Istanbul province.In [40], authors have proposed a bidirectional LSTM (Bi-LSTM) model by considering both past and future information to forecast PM2.5 of three cities in Korea and five cities in China.Accordingly, its performance is superior to GRU and LSTM in terms of the air quality forecast for these cities.Concretely, with short-term prediction, these models have similar performances.Meanwhile, with long-term prediction, Bi-LSTM outperformed GRU and LSTM.Wang and colleagues [41] developed a combination of the CT (chisquare test) with the LSTM model to analyze the relationship between air pollution variables.The paper identified the factors influencing air quality by using CT for such a task.Then, the AQI level was predicted by the LSTM model using a dataset collected at Shijiazhuang in the Hebei Province of China.In comparison to other competitive methods (i.e., SVR, MLP, BP neural network, Simple RNN), the proposed method provides an accuracy of 93.7% (the highest one).In addition, the proposed method outperforms the baseline methods in terms of MAE, MSE, and RMSE metrics.In [42], a GRU layer has been added to the LSTM structure to improve the accuracy of the air quality prediction problem.Experimental results with a dataset collected in Delhi show the outperformance of the proposed approach compared to other competitive methods of linear regression, GRU, kNN, and SVM in terms of MAE and R 2 .

Recurrent Neural Networks (RNN).
In [43], the authors have proposed an RNN algorithm to predict PM2.5 in Japan by employing a dynamic method to pre-train the model based on multi-step-ahead time series prediction.[44] apply RNN to predict PM10, O 3 , SO 2 , CO and NO 2 .The dataset is collected from different sensors with intervals of 1 hour.In addition, the authors applied fine-tuning to find the best hyperparameters of neural network structure and optimization function.Moreover, the investigated model can be applied to predict similar pollutants in other neighboring areas.

Gated Recurrent Unit (GRU).
In the current literature, many research results indicate that existing models are best at short-term forecasts.Meanwhile, improving existing approaches to forecasting long-term air quality is necessary.[45] proposed an algorithm that is considered an enhanced version of GRU (named BiAGRU) by combining bidirectional gated recurrent unit integrated with an attention mechanism.By means of experimental analysis, the proposed model is superior to many traditional machine learning models and modern deep learning models.
Referring to [46], a model based on Gated Recurrent Units (GRUs) has been proposed to forecast NO 2 pollutant concentration.The proposed model is assessed and fine-tuned for such a task concerning the number of features, look-backs, neurons, and epochs.Also, in Beijing [47], authors introduced a model based on spatiotemporal CRUs combined with a Geographic Self-Organizing Map (GeoSOM).Concretely, all monitor stations were clustered using timeseries features and geographical coordinates.Later, GRU models were proposed for clusters, and Gaussian vector weights were used to weigh different models in predicting the target sequence.Experimental results showed the technique's efficiency compared to several state-ofthe-art ones regarding MAE, MRE, and R2 metrics.
Since existing models do not fully consider the temporal dependencies, spatial correlations, and feature correlations hidden in a given dataset, in [48], authors examined these correlations by introducing a spatiotemporal deep learning model named Conv1D-LSTM based on 1-D convolutional neural network and LSTM for spatial and temporal correlation feature extraction.In addition, a fully connected network exploits these features for the air quality prediction problem.Furthermore, missing data have been imputed to enhance the quality of air quality prediction.The proposed method outperformed other well-known baseline methods through experimental analysis.

Data fusion
Besides techniques for the air quality prediction problems as mentioned above, there exist further works solving the problem by using data fusion.In this section, we will provide a brief discussion of these approaches.

Multimodal data.
Air pollution is one of the most worrying issues facing the world today.So, forecasting of particulate matter (PM) is necessary nowadays.Ton et al. [49] pointed out that combining meteorological features and timestamp information in Hanoi air quality datasets improved the results of PM10 and PM2.5 forecasting.The authors extracted two new features, which were weekend and working hour, from the "Date Time" recorded variable.Then, encoding the time into a vector of 0, 1 to include two new variables, weekend and working hour.First, with the variable weekend, the time vector from Monday to Friday was encoded as 0. On the other hand, during Saturday and Sunday weekends, the time vector was 1.Second, with working hour, the time vector was 1 in the range 7 AM to 7 PM, whereas this was 0. According to the authors, the time steps of the two new variables weekend and working hour were synchronized with other weather and air quality variables.It was highly efficient in 68% of the cases compared to other methods by conducting five deep learning models: MLP, 1D-CNN, LSTM, Bi-LSTM, and Stacked LSTM.Besides, in the long-term forecast of PM concentrations, the Vanilla LSTM model with combined features performed better than the other.
Similarly, to predict the PM2.5 air pollution level in the short-and medium-term, Tejima and colleagues [50] also proposed a framework that looks for hidden associations between traffic factors and air pollution.The six steps in their framework can be defined as follows: (1) Use any machine learning algorithm to extract features from the traffic images, (2) Create a new dataset by combining the extracted features dataset and air pollution dataset using time, (3) Use fuzzy rules to convert this new dataset into an uncertain temporal database, and (4) Use uncertain periodic-frequent pattern mining techniques to uncover hidden relationships between various traffic factors and air pollution, ( 5) estimate air pollution level from a given image using transfer learning on a pre-trained model, and ( 6) predict air pollution level using estimated air pollution level and mined patterns dataset.Experimental results show that their method can accurately estimate and predict air pollution levels, ranging from 77% to 98%.
3.4.2Neighbor stations.Currently, air pollution and urban life influence human health.Therefore, environmental and data science experts always try to find the most accurate way to predict and provide timely warnings to humans.Specifically, Dao et al. [51] use methods of data imputation for the UrbanAir dataset to predict air pollution at a place without a station by using neighbor stations and predict air pollution of Dalat and discover the correlation/association between air pollution and human activities.The authors divided the article into two tasks that need to be performed: Subtask 1 only used environmental data to predict air pollution and only used traffic data in Subtask 2. Subtask 2 accepts training a prediction model using environmental and traffic data, but only traffic data is used to predict air pollution.While Subtask 2 only accepts AQI levels, Subtask 1 requires predicting both the exact value and AQI level of each pollutant concentration.The paper encouraged researchers to develop a generic framework to discover a correlation among different traffic factors, weather, and air pollution in a locality.By using these correlations, the authors improved the accuracy of AQI prediction and understood the mutual impact between urban life and air pollution.
Besides, Nguyen et al. [52] also introduced a dataset containing data about personal life and the surrounding environment, collected periodically along predetermined routes in Ho Chi Minh City, Vietnam.They also introduced self-developed devices and system architectures for data collection, storage, access, and visualization.There were interesting research topics and applications, including understanding the correlation between human health, air pollution, and traffic congestion.

Images.
Human health is mostly impacted by air pollution.Over time, there has been an increase in the number of patients and disease reports related to air pollution.By using lifelog data and urban nature similarity, a method was introduced in [53,54] that could predict AQI at a local and individual scale with a few images taken from smartphones and open AQI and weather datasets.Various public datasets pertaining to weather, air pollution, and images are used to develop and evaluate image retrieval and prediction model techniques.The outcomes support their hypothesis regarding the strong correlation between the AQI and snapshots of the surrounding area.

Variable selection.
Currently, several statistical and machine learning methods are used to uncover useful information and patterns for enormous datasets.The common model selection (variable selection) methods include Neural Networks (NN) and RF.The statistical methods like the Least Absolute Shrinkage And Selection Operator (LASSO) [55] and principal component analysis (PCA).The authors [56] have proposed combining NN with LASSO or RF for even better results.In addition, they tested these new methods along with classical techniques (ordinary least square and feed-forward NN) using Monte Carlo simulation and real-world air quality data from Italy.The study found that the combined methods achieved lower errors, suggesting they outperform the traditional approaches.
Many methods have been proposed to improve the performance of air quality prediction.However, most investigated methods are based on complete datasets.Therefore, we need to impute missing values to reinforce the prediction models' performance.

Data imputation: Recent techniques
Various statistical and machine learning methods [57][58][59] have been developed to overcome the problem of missing data for time series, to fill in the missing values in the data, or in other words, imputing the missing values.However, methods have limitations in handling data with high missing rates or changes in available variables.In addition, the performances of these methods vary widely according to the type of data, noise levels, or other factors and show a high dependence on correlations within the data.
In this section, we want to provide an overview of the relationships among the given imputation techniques and comparisons and then discuss them individually.

Ignoring.
Ignoring [60,61] is a method that completely ignores missing values when conducting the analysis process.Although this is a simple method, if the rate of missing data is high enough to influence the analysis outcomes, it is highly dangerous.
4.1.2Deletion.An approach of removing/deleting missing observations from raw data is called Deletion [62,63].It is also a frequently used method when the missing values of the data are not high, and removing missing values will not affect the analysis results.Nevertheless, when the missing data rate is high, deleting missing values makes the data incomplete and unsuitable for some other analysis applications.
4.1.3Mean/Median/Mode imputation.Mean/Median/Mode are simple methods.There, the missing value for a continuous variable is imputed by the mean/median of the observed values.When the missing values for a categorical variable are replaced by the Mode of the observed values, these approaches are quick to compute and simple to implement.Mean/ Median/Mode Imputation methods [64] are a solution for better analysis results when they solve the issue of handling missing data values, whereas Ignoring and Deletion methods are thought to provide poor results in the analysis or data mining process when the missing data rate is high.Furthermore, the limitation of these methods is that the bias created by multiple values on the data has the same value, even if the data are MCAR.As a result, it may bias the estimation of skewed distributions.

Regression imputation.
There are two steps in Regression imputation [65,66].The first is to estimate a linear regression model using the target variable's observed values along with the explanatory variables.After that, one can use the model to predict values for the missing cases in the target variable.Missing values of the variable are replaced based on these predictions.There are two types.First is deterministic regression imputation.It means missing values are replaced with the exact prediction of the regression model.The second is stochastic regression imputation, which adds an additional random error term to the predicted value imputed by deterministic regression imputation.Regression imputation is the improvement over Mean/Median/Mode imputation.Besides, it has disadvantages, including the assumptions of error distribution and linear relationship, which are relatively strict and give poor results for heteroscedastic data.

Last Observation Carried Forward (LOCF).
Last Observation Carried Forward [67,68] fills in missing values by using the last observed value of the given features in each sample; if there is no previous observation, 0 will be filled in.LOCF assumes that the missing data is constant or follows a gradual change.However, if the missing values are not stationary or the sensor readings exhibit abrupt changes, this method may introduce bias and inaccuracies.

Multivariate Imputation by Chained Equations (MICE).
Multiple imputation offers numerous benefits compared to the single imputation methods mentioned above.MICE [69,70] is one of the most popular multiple imputation techniques.The process uses an iterative set of regression models to impute missing data from a dataset.It imputes missing values in the dataset's variables by focusing on one variable at a time.Once the focus is placed on one variable, MICE uses all the other variables in the dataset to predict missingness in that variable.The prediction is based on a regression model, with the form of the model depending on the nature of the focus variable.MICE methods perform better and are more reliable for data with a limited sample size.
On the other hand, MICE has several benefits, such as results in unbiased estimates, being easily interpreted in a Bayesian context, and having a large number of workable algorithms built into the MICE framework.It is worth noting that MICE is especially helpful when missing values are associated with the target variable in a way that causes leakage.Users can also state what they believe to be the likely distribution of the missing value using MICE.However, MICE comes at a high computational cost.
4.1.7First five last three logistic regression imputation (FTLRI).Chen et al. [71] proposed an interesting approach for data imputation, namely FTLRI, for time-series air quality data.The paper is based on the traditional logistic regression and a presented "first Five & last Three" model.These techniques could explain relationships among disparate attributes and then derive highly relevant data, for both time and attributes, to the missing data, respectively.The results showed that FTLRI has a significant advantage over the compared imputation approaches, particularly in short-term and long-term time-series air quality data.Furthermore, FTLRI can perform better on datasets with relatively high missing rates (about 40%) since it only selects highly relevant data to the missing values instead of relying on all other data like other methods.

Autoregressive Distributed Lag (ARDL).
Selecting criteria is considered an important issue in the Autoregressive Distributed Lag (ARDL) model.El et al. [72] proposed the use of four imputation methods (k-Nearest Neighbors, Expectation-Maximization, Classification, and Regression Tree, and Random Forest) for handling the missing values.Their goal was to improve the accuracy of the model with the optimal order of lags.They compared these methods using real economic data related to foreign direct investment (FDI) in Libya.Their findings suggest that the Expectation-Maximization method performed best compared to the others.
4.1.9EPK.Next, Mohamed et al. [73] introduced a new imputation technique called EPK.Using the Monte Carlo simulation, they evaluated the effectiveness of nine different imputation methods, including EPK.The simulations focused on a specific type of statistical model (binary logistic regression) when the missingness mechanism is MAR.Additionally, they tested the methods on real data from social network advertising.The results from both simulations and real-world applications showed that EPK outperformed other imputation methods regardless of where the missing data occurred (independent variables only, dependent variable only, or both).

Machine-learning approaches
The recent methods for imputing missing data in time series led to more accurate and improved imputed data than traditional approaches.Choosing an appropriate imputation method for a specific type of missing data significantly impacts the performance of data imputation.

k-Nearest Neighbor Imputation (kNNI)
. kNNI method [14,69] uses the k-nearest neighbor to identify similar samples with normalized Euclidean distances or some other type of distance and impute the missing values with the average value of its neighbors.The k-nearest neighbor method can impute continuous variables (by using the mean or weighted mean among the k-nearest neighbors) and categorical variables (by using the Mode among the k-nearest neighbors).Both quantitative and qualitative features are handled by kNNI with ease.However, it performs computationally intensively for large data since it searches through all the datasets and requires the specification of hyper-parameters that can greatly affect the results.
4.2.2MissForest.The Random Forest (RF) algorithm can also applied for multivariate time series data, employing an average of the corresponding full values.Using proximity data points, this algorithm then iteratively improves the imputation of missing data.Generally, mis-sForest is a technique that was proposed by [74] based on Random Forests.The article showed that RF intrinsically constitutes a multiple imputation scheme by averaging many unpruned classification or regression trees.The imputation error can be estimated without a test set using Random Forest's built-in out-of-bag error estimates.Furthermore, missForest performs better than K-nearest neighbors and other imputation techniques, giving outstanding results for data containing non-linear relations and/or complex interactions.Additionally, it works well with data containing both qualitative and quantitative features.When using missForest, there is no need to tune parameters, do categorical encoding, or standardize the data.MissForest can be utilized to achieve good imputation results even in high-dimensional datasets with a large number of variables compared to the sample size.

Deep neural networks
In addition, many deep learning techniques have been developed to solve imputation for missing values in time series data.

GRU-D.
Chen et al. [75] proposed the GRU-D model, which is a deep learning model based on Gated Recurrent Unit (GRU) that takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it does not only captures the long-term temporal dependencies in time series but also utilizes the missing patterns to achieve better prediction results.

Deep auto-encoder.
One method that can be used for data imputation is the autoencoder structure.It extracts features from low-dimensional layers using the encoder and decoder structure, and the decoder recovers missing values.As such, it can function as a methodological feature.[76] presented one technique using deep autoencoders for spatiotemporal challenges involving imputing missing data.The proposed method for capturing temporal and spatial patterns was a convolution bidirectional LSTM.Additionally, the authors analyzed an autoencoder's latent feature representation in spatiotemporal data and illustrated its performance for missing data imputation.The experimental result illustrated that the convolution recurrent neural network outperforms state-of-the-art methods.

MultiLayer Perceptron (MLP).
Next, [77] estimated the missing values of a variable in multivariate time series data using a MultiLayer Perceptron.To achieve the best prediction performance for the specified time series, an automated technique was employed to identify the optimal MLP model architecture, filling in a long continuous gap instead of relying on isolated, randomly missing observations.The findings demonstrated that using MLP to fill a big gap produces better outcomes, especially when the data behaves nonlinearly.

Raindrop.
Raindrop [78] is a Graph Neural Network-based algorithm embedding irregularly sampled and multivariate time series.It is inspired by how raindrops hit a surface at varying time intervals and create ripple effects propagating throughout the surface.Raindrop helps handle missing data with irregular time series.It represents every sample as a separate sensor graph and models time-varying dependencies between sensors with a novel message-passing operator.It estimates the latent sensor graph structure and leverages the structure together with nearby observations to predict misaligned readouts.This model can be interpreted as a graph neural network that sends messages over graphs that are optimized for capturing time-varying dependencies among sensors.Another typical work comes from Festag et al. [79], where the authors developed a system based on Generative Adversarial Networks that consist of recurrent encoders and decoders with attention mechanisms and can learn the distribution of intervals from multivariate time series conditioned on the periods before and, if available, periods after the values that are to be predicted.
Therefore, it is worthwhile to understand the data types with missing values and propose an effective and robust strategy to fill time-series air quality data with missing values.In experiments, to obtain a thorough comparison, we compare some classical by widening traditional imputation methods with some recently developed imputation methods:

Data imputation for air quality prediction
• Mean Imputation [80]: The missing values are replaced with the mean value of the corresponding features.
• Median Imputation [81]: It is similar to Mean imputation, but the median is utilized instead of the mean.
• Multivariate Imputation by Chained Equations (MICE) [69]: MICE imputes missing values in the variables of the dataset by focusing on one variable at a time.Once the focus is placed on one variable, MICE uses all the other variables in the dataset to predict missingness in that variable.The prediction is based on a regression model, with the form of the model depending on the nature of the focus variable.
• k-Nearest Neighbor Imputation (kNNI) [69]: It uses the k-nearest neighbor method to identify similar samples and impute the missing values with the average value of its neighbors.The k-nearest neighbor method can impute continuous variables (the mean or weighted mean among the k-nearest neighbors) and categorical variables (the mode among the k-nearest neighbors).
• Self-attention-based imputation for time series (SAITS) [13]: a self-attention mechanism for missing value imputation in multivariate time series.Typically, it is trained by a jointoptimization approach.SAITS learns missing values from a weighted combination of two diagonally masked self-attention (DMSA) blocks.DMSA explicitly captures both the temporal dependencies and feature correlations between time steps, which improves imputation accuracy and training speed.Meanwhile, the weighted-combination design enables SAITS to dynamically assign weights to the learned representations from two DMSA blocks according to the attention map and the missingness information.
• Bidirectional Recurrent Imputation for Time Series (BRITS) [12]: a method for filling the missing values for multiple correlated time series.It learns the missing values in a bidirectional recurrent dynamical system without any specific assumption.The imputed values are treated as variables of the RNN graph and can be effectively updated during the backpropagation.
• Multi-directional Recurrent Neural Network (MRNN) [82]: is a neural network architecture including two blocks (interpolation and imputation) trained simultaneously.The interpolation process operates within data streams, while the imputation process operates across data streams.The interpolater uses a Bi-directional Recurrent Neural Network (Bi-RNN) to interpolate missing values within each channel along the time dimension.Afterward, using a simple, fully connected neural network, the imputer can compute an estimate for each time step along all channels.
• Transformer [83,84]: Transformer is a self-attention-based model.It uses transformer architecture in an unsupervised manner to perform missing value imputation.Unlike the existing transformer architectures, this model only uses the encoder part of the transformer due to computational benefits.It is a joint-optimization training approach of imputation and reconstruction for self-attention models to perform missing value imputation for multivariate time series.
In the following sections, we will compare different data imputation techniques with various datasets related to air quality prediction.

Experimental setup
The efficacy of the missing data imputation methods depends heavily on the problem domain, for example, sample size, types of variables, and missingness mechanisms.
The descriptions of the six datasets used in this work and their preprocessing details are elaborated on below: The first dataset is a time-series air quality dataset with categorical contextual information (time and weather); the air pollution PM2.5 values were collected from sense-boxes installed in Frankfurt, Germany.The dataset has been read from 14 different sensors in close spatial proximity.The dataset was efficiently labeled and can be used as a gold-standard dataset for unsupervised problems.Similarly, the test set of this data takes data from 20% original dataset, 20% of the remaining 80% of the original dataset is used for validation, and the remaining is for training.We also chose every 30 minutes of data and every 1-hour consecutive step to generate time series data samples.
The second dataset is Beijing Multi-Site Air-Quality.It includes hourly air pollutant data from 12 monitoring sites in Beijing.This dataset collected data from 01/03/2013 to 28/02/2017 (48 months in total).For each monitoring site, 12 continuous time series variables are measured (e.g., PM2.5, PM10, SO2).The test set of the third dataset takes data from 20% original dataset, 20% of the remaining 80% of the original dataset is used for validation, and the remaining is for training.The validation set contains data from the following 05/11/2016.The training set takes from 18/12/2015.In addition, we take every one-hour data to generate a time series of data samples for every 24 consecutive steps.
The third dataset is from the Environmental Protection Administration, Executive Yuan, R. O.C. (Taiwan).It was only collected in Northern Taiwan in 2015, containing air quality and meteorological monitoring data.Besides, this data included 25 air pollution stations and 21 features.The test set takes data from 20% original dataset, 20% of the remaining 80% of the original dataset is used for validation, and the remaining is for training.Specifically, the training set takes place on 15/01/2015, the validation set takes place on 01/09/2015, and the remaining part is used as a test set.We selected every 1-hour data and every 12 consecutive steps in experiments to generate time series data samples.
The fourth dataset is Dalat Air Quality.Urban Air provides a streaming dataset from CCTV and air station networks installed in Dalat City, Vietnam.The system runs 24 × 7 and has several real problems, such as sudden camera/station turn-on/switch-off, noise, and outliers.There are ten air pollution stations (i.e., sensors 01-10), three attached to weather stations (i.e., sensor01, sensor02, sensor03), and fourteen CCTV cameras.Furthermore, the test set of the dataset takes data from 20% original dataset, 20% of the remaining 80% of the original dataset is used for validation, and the remaining is for training.We generate time series data samples by selecting every 1-hour data and every 24 consecutive steps.
The two final datasets were collected hourly at two monitoring stations in Hanoi: Cau Giay district and Minh Khai district.For example, Cau Giay dataset with observation time from 25/ 2/2019 to 25/11/2020, and Minh Khai dataset from 01/1/2019 to 25/11/2020 record measured features including PM10, PM2.5, SO 2 , O 3 , NO 2 , NO, NO x , CO, Temperature, Humidity, Wind speed, Rain, Wind direction, Atmospheric Pressure, Solar Radiation.Besides, 20% of the original dataset is utilized for the test set, 20% of the remaining 80% is used for validation, and the remaining is used for training.Then, we choose one hour's worth of data per 24 consecutive steps to create time series data samples.
After the preprocessing step, general information about the datasets is described in Table 1.
We generate artificial missingness to evaluate all imputation methods used.It is important to note that normalization is applied in the preprocessing of all datasets.For each dataset, the missing ratio p is varied from 10% to 80% with increments of 10% for each dataset to evaluate the models at different missing ratios.For p 2 {10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%}, we train the model to fill the missing values and then calculate the imputation accuracy.However, with the high missing rates in the original Dalat dataset (greater than 40%) for this dataset, we generate extra artificial missing values with missing rates of 10%-30% only.
Besides, the specific information about the architectures of SAITS, BRITS, MRNN, and Transformer models in this paper can be described in Table 2.
After the model imputes all missing values, two metrics are used to measure the imputation methods' performance: Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi  For multiple imputation cases with K imputations, we have K values for MAE and RMSE per dataset, and we use the average to evaluate the model performance.We designed each experiment 10 times.We report mean MAE and RMSE, along with their running time, as the performance metrics.
In this paper, all the models were trained/tested on a computer with the following configurations: Intel(R) Xeon,(R) Gold 6254 CPU @3.10GHz/512GB RAM.
Detailed experimental results and the running time of imputation methods on six datasets are recorded in Tables 3-8 and described in Figs 2-7. 3 presents the results of the traditional imputation method compared with other imputation methods regarding the experiment accuracy and running time on the Frankfurt air quality dataset.One can see that when the missing rate of the dataset increases from 10% to 80%, Mean and Median methods have an almost constant MAE error (fluctuating around 0.787-0.788);the MRNN model gives the highest error (greater than 0.904).Moreover, the Transformer model has an MAE error from 0.651 to 0.77; the kNNI model alone has the smallest MAE and RMSE errors among the remaining machine learning models, such as MICE, even lower than the currently used neural-network models, such as SAITS and BRITS.In general, in this dataset, the MAE and RMSE errors of the kNNI model both give the lowest and most stable results among the remaining missing data models when the missing rate of the original data is 0%, and the artificial missing data rate gradually increases from 10% to 80%.As depicted in Fig 2c, it is worth noting that the running time of the models used in this dataset mainly increases when the missing rate of the dataset changes from 10% to 80%.On the other hand, compared to traditional models or basic machine learning models, although models based on neural networks have a long calculation time, MRNN gives relatively positive results and is the most effective among the models using neural networks in terms of Mean, Median, and MICE.On the other hand, when the original dataset is not missing and the data size is larger than one million records (for example, Frankfurt air quality data), kNNI is considered the model that gives the best results with an artificial missing rate of 10%-80% and time execution time gradually increases from 62.684 × 10 3 milliseconds to 941.832 × 10 3 milliseconds, followed by SAITS with computation time decreasing from 843.803 × 10 3 milliseconds to 391.645 × 10 3 milliseconds.

Results on different datasets 6.2.1 Frankfurt dataset. Table
6.2.2 Beijing dataset.Table 4 depicts the results of imputation methods compared with other imputation methods on the Beijing air quality dataset.In this dataset, the SAITS model also gives the lowest MAE measure compared to other machine learning models; the model error varies from 0.142 to 0.349, followed by BRITS.Meanwhile, the traditional data-filling models vary from 0.724 to 0.885 as the missing ratio gradually increases from 10% to 80%.On the other hand, the MAE error of the SAITS model when the artificial missing rate of data changes from 10% to 40% is lower than that of the MICE model; on the contrary, the RMSE  missing rate increases from 10% to 80%.Next, kNNI is a model with large fluctuations in calculation time, gradually increasing as the missing rate increases.Moreover, the calculation time of the Transformer gradually decreases and changes sharply as the missing ratio increases.Compared to traditional machine learning models, MRNN has the most stable and fastest calculation time compared to the remaining models in this dataset.

Taiwan dataset.
Table 5 shows that the MAE error between traditional models and current models using neural networks grows larger as the missing rate of input temporal data increases on the Northern Taiwan air quality dataset.Specifically, SAITS is the model with the lowest error (ranging from 0.121 to 0.284), followed by BRITS (error only from 0.136 to 0.29) in this data.Meanwhile, methods such as Mean, Median, kNNI, or MICE give errors when filling in missing values that deviate greatly from the original value.Besides, when the artificial missing rate of the data changes from 10% to 60%, the MAE and RMSE errors of the SAITS model give the lowest results.However, when the missing data rate reaches 70%, the SAITS model's MAE error is the smallest, but the RMSE error is higher than that of the kNNI model.When the missing rate is from 80%, the MAE and RMSE error results of kNNI are the lowest among the models, followed by BRITS.
On the one hand, in Fig 4c, we can see the computational time of machine learning models like kNNI is the smallest after traditional missing data filling models like Mean and Median.Besides, MRNN is the model with the least computational time among machine learning models, followed by MICE.The remaining models have fluctuating and irregular calculation times,  [51].In this table, traditional methods such as Mean and Median give a constant MAE measure (about 0.83-0.86)when the missing rate of data changes from 10% to 30% and almost the result of these measures is the largest compared to the remaining missing data filling models.Meanwhile, the neural network models used in this dataset, such as SAITS and BRITS, give optimal results, which are not much different from traditional machine learning models such as kNNI and MICE (for MAE measurement, the shortest).However, when the artificial missingness ratio of the data is at 10%, MICE gives relatively low MAE and RMSE errors among the models.Furthermore, when increasing the artificial missing rate to 20%, although the MAE error of MICE is the lowest, the RMSE error of the BRITS model is the smallest.Next,    [49] are presented in Table 7.One can see that the MAE errors of MICE showed the best results among the used models (only from 0.241-0.424)when the artificial missing rate of the data gradually increased from 10% to 80%.The second best is SAITS (from 0.289-0.578),and the third one is BRITS (from 0.337-0.666).However, the RMSE error of SAITS is the lowest with artificial missing rates of 10%-30% and 50%-60%.Meanwhile, when the missing rate is 40% and increases to 70%-80%, MICE almost always gives relatively good results compared to the remaining models.Besides, the MAE and RMSE errors of the models become larger when the missing rate changes from 10% to 80%, especially the MAE and RMSE errors of the two methods Mean and Median are large, only fluctuating around 0.8.In addition, the running time of machine learning and neural network models is significantly slower than Mean imputation (0.379-0.509 milliseconds) and Median imputation (0.933-0.582 milliseconds).Also, the slowest is MICE, with a relatively large running time, ranging from 71.192 × 10 3 to 71.527 × 10 3 milliseconds.
6.2.6 Minh Khai dataset.Table 8 presents experimental results on the Minh Khai air quality dataset [49].Although the experimental time of MICE is the largest, the model gives the most optimal MAE and RMSE error results among the models used, followed by SAITS, BRITS, kNNI, and Transformer.The MAE and RMSE errors of the Mean and Median methods hardly change much when increasing the missing data rate from 10% to 80%; MAE and RMSE errors of the remaining models gradually increase as the missing data rate increases.
Furthermore, one can also see that when the original missing rate of data is less than 5%, MICE is the method that gives the most optimal results among the missing data imputation methods used in this article (specifically with the Cau Giay district and Minh Khai district air quality dataset).Besides, the running time of MICE is high and increases the fastest when the missing rate of data gradually increases with the Minh Khai dataset.On the other hand, the Cau Giay dataset does not change much over time.However, when considering the performance of filling in missing values using neural networks, SAITS is the model with the most optimal performance, followed by BRITS.We knew that Transformer is a deep-learning model designed to solve many problems.However, in this study, we can see that Transformer hardly promotes its strengths, and experimental results on different data all give much larger MAE and RMSE errors than SAITS and BRITS.

The impact of different numbers of layers
Based on the experimental results presented above, although MICE gives a better MAE measure than SAITS in some cases (specifically in datasets in Vietnam), the running time of MICE is many times longer than that of SAITS.Therefore, we propose SAITS as the model to fill in missing values for the air quality data for the multivariate time series of those tested in this paper.We now perform another test and compare the results when performing missing data filling on air quality data of SAITS with the Transformer model with the different number of layers cases (i.e., two layers, four layers, and six layers) and the BRITS model.We then propose the best model to fill in the last missing value before predicting air quality for the following year.
The result of evaluating Transformer and SAITS with two-layer, four-layer, and six-layer for both datasets is presented accordingly in Fig 8 .From there, one can see that SAITS with two, four, and six layers do not show as clearly as Transformer with two, four, and six layers in the six datasets, including Frankfurt, Beijing, and Taiwan.However, for the Dalat, Cau Giay, and Minh Khai datasets, we can see the results of both SAITS and Transformer with two, four, and six layers clearly shown.As a result, SAITS with two layers performs better than Transformers with two, four, and six layers.Besides, the RMSE performance of SAITS and Transformer with layers of all datasets in Fig 9 is unclear and changes frequently.In contrast, the running time of SAITS and Transformer with two layers in Fig 10 for both datasets is also a more effective model with four layers and six layers.

Predict air quality in Vietnam combined AQI indexes
Based on the results and evaluating the performance of the methods with different numbers of layers above, we propose the SAITS model with two layers as the most optimal missing value estimation technique with a small sample size on the three air quality datasets in Vietnam.
We started predicting factors affecting air pollution in the following 24 hours based on Vector Auto-Regression (VAR) and related features.Next, we analyzed daily air pollution levels across countries on three datasets in Vietnam, which were mentioned based on the Air Quality Index (AQI).The AQI level of each pollutant is calculated according to the instruction (Available on: https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistancedocument-sept2018.pdf) and divided into seven levels (i.e., Good, Moderate, Unhealthy for Sensitive Groups, Unhealthy, Very Unhealthy, Hazardous, and Extreme Hazardous).It only includes six air pollutants (i.e., PM2.5, PM10, CO, NO 2 , SO 2 , and O 3 ) that are required to predict their values and related AQI levels.This index tells us how clean or polluted the air is and what it means to human health.The higher the AQI index value, the greater the level of air pollution and the more negative impact it has on human health.
Additionally, we also perform the outlier detection for the datasets using Mahalanobis distance, where points with large Mahalanobis distance (greater than 97.5%) are considered to be outliers.However, we didn't remove the outliers from the datasets because deleting them will make the data become irregularly sampled time series data, and VAR is not designed for that type of data.
We analyze indexes such as SO 2 ; CO; PM10; PM2.5; O 3 recorded by sensors in Dalat, Vietnam.The concentration of PM2.5 collected at some stations is large (ranging from 50-100μg/ m 3 ), but the air quality collected from stations in this place is at an acceptable average level, some stations exceeding the threshold range from 100-106μg/m 3 but at a poor level and affecting sensitive groups of people.(Similar to the concentration of fine particles PM10).Meanwhile, the levels of CO, SO 2 , and O 3 in the air are low, only from 1.09-10.7μg/m 3, completely at a good level and do not have much impact on human health.In general, from the end of 2019 to the beginning of 2020, the concentration of these indicators increased and posed serious harm to human health.Besides, some other factors can interfere with the fluctuation of air pollution in Dalat City, known as the city of thousands of flowers, and attract many people, especially young people, to visit and relax during festivals or on the weekend.Next, Dalat is geographically located in a mountain valley and has cool weather that may impact air pollution and some activities.
Next, we also analyzed factors such as PM10, PM2_5, CO, SO 2 , O 3 recorded by two districts of Cau Giay and Minh Khai in Hanoi, Vietnam.Fine dust concentrations PM10 and PM2.5 in the Cau Giay district are mostly at moderate levels according to the AQI categories table, and they do not cause serious effects on human health.However, on some days when the concentration of these indicators is recorded at a high level, greater than 100μg/m 3 , and even on some days, the concentration of fine dust in some time frames exceeds 300μg/m 3 , seriously affecting the health of the people of the Capital.In addition, the number of motorbikes and vehicles circulating in Hanoi is quite large, which is also the cause of air pollution here; CO concentration is greater than 1000μg/m 3 .For example, from 22/2/2023-23/2/2023, CO concentration was recorded above 10, 000μg/m 3 at an alarming level.In addition, the concentration of SO 2 is relatively low, possibly because this area does not have many manufacturing plants, so the amount of toxic chemicals such as SO 2 released into the environment is small, at a good level, and does not affect the human health.From the end of 03/2019 to mid-04/2019, the concentration of O 3 gradually increased, from 100-350μg/m 3 , changing from a level that is not harmful to human health to a seriously harmful level.Besides, the concentration of fine dust particles PM10, PM2.5 in the air recorded in the Minh Khai district dataset is also quite high, which is not good for human health.Some days, the fine dust concentration of these particles exceeds 479μg/m 3 and 255μg/m 3 , respectively PM10 and PM2.5.Furthermore, the amount of SO 2 and O 3 in the Minh Khai district dataset is quite low, at a normal level according to the AQI categories scale, so it does not affect human health.However, the concentration of CO in the air is very high; most days, the concentration is recorded to be greater than 1000μg/m 3 , which is considered "Exceeding AQI." Recommendations for hazard classification should be implemented.In summary, Hanoi is known to domestic and foreign friends as the Capital of Vietnam, a place to work and welcome heads of state.Based on the air pollution problem in the Cau Giay and Minh Khai districts, we need to take many measures to reduce air pollution and improve the environment to attract tourists, creating economic development and international cooperation conditions.

Discussion
There are many estimation techniques to handle missing data.A lot of them focus on the missing values of the time series.However, these techniques might not be able to capture time information and produce reliable imputation results if timestamps are missing.Therefore, expanding on current techniques to impute missing timestamps may be suitable for handling such situations.Research on deep learning modeling is a being cared area.Numerous novel deep-learning models with practical applications have been proposed in recent years.There is a growing number of research papers on deep learning for imputing missing data, and these new approaches seem promising.However, in practical applications, the validity and strength of these models must be carefully assessed.In many cases, reliable and reproducible code is frequently unavailable or incomplete.Compared to conventional statistical methods, the number of hyperparameters for deep learning models typically requires much more tuning.In some applications, the hyperparameter search on big data may be prohibitively expensive due to the required training time or memory size.However, our research showed that for data with small, moderate, or even large sample sizes (i.e., when the sample size is less than 30,000), the stability and convergence of the deep learning models needed to be revised.
Numerous factors, including the sample size of the data, the distribution of the variables, the number of missing values in the data, the correlation structure of the data, and potential missingness mechanisms, affect how effective different missing data imputation techniques are.
While this work presents a thorough investigation of time series imputation techniques and provides practical implications for practitioners, it also has some limitations.For example, different types of geographic locations, such as mountains and plains, may affect air quality.Collecting more datasets and examining the patterns for various types of geographic locations may help draw out common patterns and insights to improve the imputation quality.However, due to limited data available, we have not achieved that goal yet.In addition, this work so far has only concentrated on examining the effect of missing data for the missing at random pattern.However, it is also possible that air quality data is missing, not at random.These will be topics for our future research.

Conclusions
We have presented an investigation of the impact of data imputation techniques on the air quality prediction problem.In general, SAITS gives the lowest error, and the difference in running time is negligible compared to the missing value imputation efficiency that SAITS provides when the missing rate increases.Besides, BRITS is the model that gives the second-best error among deep learning models, only after SAITS.kNNI running time can increase significantly as the missing rate increases.However, this may not be true for other methods.Also, kNNI proves itself to remain a promising imputer for the dataset.In addition, for three datasets (Northern Taiwan, Beijing, and Frankfurt) which have a large sample size, at the high missing rate of 80%, kNNI outperforms other techniques, including the state-of-the-art such as SAITS, BRITS, and Transformer.Meanwhile, other cases were varied using limited sample size and ratios of missing data.The experiment results show the conventional method, MICE, outperforms the recently proposed deep learning methods, such as SAITS and BRITS, in these experiments.
The outcomes of this article can open a new direction for predicting air pollution.However, as discussed, it also has some limitations, such as the experiments concentrated only on missingness at random, and the paper has not been able to draw insights by grouping datasets based on geological characteristics due to limited data.Therefore, in the future, we will collect more datasets to examine the imputation quality based on types of geographic locations or some other characteristics and consider the imputation effects for nonrandomly missing data.Also, we will develop an ensemble technique combining the latest missing techniques and SAITS to enhance the imputation performance.Moreover, the process can integrate the knowledge in a particular domain to support extracting helpful information from datasets [86].In addition, in the future, we plan to empirically evaluate the performance of imputation techniques for other types of environmental data, such as imbalanced missing data.Last but not least, we want to investigate if methods to combine datasets such as ComImp [87] can be used to combine air quality datasets to improve the imputation and prediction quality.

A
flowchart for the setup of the training process for the Machine Learning and Deep Learning framework proposed in this work is shown in Fig 1.

Model n_layers d_model d_inner n_heads d_k d_v dropout attn_dropout Batchsize rnn_hidden_size
/doi.org/10.1371/journal.pone.0306303.t002where X d;real t and X d;imputed t are real and imputation values, respectively, and n denotes the size of dataset with missing values.

Table 4 . Experimental results of different methods in Beijing air quality dataset.
is lower than that of SAITS, and the lowest in the remaining used models such as BRITS, kNNI, Transformer, Mean, Median, and MRNN.The artificial missing rate of data ranges from 50% to 70%, the MAE and RMSE errors of the SAITS model are stable again, and the experimental results obtained are the smallest among the models.The remaining models give a relatively large error, with MRNN having the largest error.When the missing data rate is at 80%, the kNNI model gives the best MAE and RMSE errors compared to traditional data filling or machine learning models.In addition, Fig3cshows the computational time of models such as Median, MRNN, MICE, SAITS, and BRITS remains almost constant when the https://doi.org/10.1371/journal.pone.0306303.t004error of MICE

Table 7 . Experimental results of different methods on Cau Giay (Vietnam) air quality dataset.
By comparing the performance of the Mean, Median, kNNI, MICE, SAITS, BRITS, MRNN, and Transformer models, we see that SAITS is the best missing data imputation model on Northern Taiwan and Beijing air quality dataset with an artificial missing rate under 70%.One can see that when the original missing rate of the dataset is less than 30% (or 10%).The dataset only has a few hundred thousand records.SAITS seems to be the model with the lowest error, and model execution time also gradually decreased (from 4223.126 × 10 3 milliseconds to 1117.453 × 10 3 milliseconds with the Northern Taiwan dataset and from 403.223 × 10 3 milliseconds to 259.380 × 10 3 milliseconds with the Beijing dataset) as the missing rate of the dataset increased.6.2.4 Dalat dataset.Table 6 presents similar experimental results on the Dalat air quality dataset

Table 8 . Experimental results of different methods on Minh Khai District (Vietnam) air quality dataset.
when continuing to increase the artificial missing rate of the model to 30%, the experimental results, MICE is the model with the smallest MAE, and kNNI is the model with the lowest RMSE of all.On the other hand, the calculation time of most models increases when the data's artificial missing rate increases.Accordingly, MRNN is the model with the fastest computation time, followed by SAITS, Transformer, BRITS, and MICE in Fig 5c.6.2.5 Cau Giay dataset.The experimental results on the Cau Giay District air quality dataset